GROUP WORK - DEADLINE 15-Dec-23.
Please submit your final report using this form.

Radon, a natural radioactive gas, is a known carcinogen that can cause lung cancer, especially in high concentrations. In the United States, it’s estimated to lead to thousands of lung cancer deaths annually. Radon levels in homes across the country vary significantly, with some houses having dangerously high concentrations.

To pinpoint areas with high radon exposure, the Environmental Protection Agency conducted radon measurements in over 80,000 randomly selected houses nationwide. Our objective in analyzing this data was to estimate the range of radon levels in each of the approximately 3000 U.S. counties. This information would help homeowners decide whether to measure or reduce radon in their houses based on local conditions.

For the analysis, we organized the data hierarchically: houses within counties. If we were to consider multiple measurements within houses, it would create a three-level hierarchy of measurements, houses, and counties.

Creating a reproducible lab report

This time, you will need to use your own RStudio environment. You can also use one of the spaces you used for a previous lab, if you like!

  • Load the tidyverse, table1, GGally, broom, broom.mixed, lmerTest, faux, marginaleffects and the modelsummary packages, and any other packages you might deem useful.
  • Load the radon.rds data set into your workspace, and assign it into an object called finches.
radon <- read_rds("https://bit.ly/BMLR-Radon")
  • Knit your file to see that everything is working.
  • For each question, please add the text of the question, the code chunks and your answer.

Simulating the data

Simulating data is an important step when analyzing data, as it serves a range of purposes. First, it helps clarify our assumptions and expectations regarding treatment effects we might realistically observe, the types of interactions we should consider, the size and features of the data we need to establish the claims we want to make. We can work in a “clean” environment where variables are truly normal, there are no missing values and data is “well behaved”. Once we grapple with an ideal world we can slowly add complexities and see how they change the picture.

Secondly, simulating “fake” data is a general method to understand how statistical methods work when repeatedly sampling data. By automating the sampling processes, we can evaluate the model’s performance. We can define the parameters in our “fake” population and compare the models estimations and predictions with those true values, and we can include features to challenge our assumptions and improve our understanding. We simulate in order to test and validate our models before applying them to real data.

Third, simulating fake data helps find issues in our code. If your model struggles to recover true parameters with large samples or low data variance, it could signal a coding or conceptual problem. Overlaying simulated data with assumed and fitted models can aid in identifying these issues.

Finally, simulating synthetic data is crucial when planning new studies or collecting data. It provides a basis for what to expect, ensuring a better understanding of potential outcomes.

In the first part of this lab we are going to replicate some of the ideas presented in a post written by Michael Clark. You may consider reading the post before continuing.

We are going to use the faux package to simulate nested data (observations nested within clusters). First, we define the number of clusters n_cluster and the number of observations in each cluster, obs_per_cluster. For example, we might want to have 20 clusters with 4 observations in each cluster (to keep things simple, let’s stick to a “balanced” data-set).

obs_per_cluster <-  4
n_cluster <-  20

Now, let’s generate some data, where (level 1) observed outcomes are related to predictors in the following manner:

\[ Y = a_i + b_i\cdot X + \epsilon \\ \text{where:}~~~ \epsilon \sim \mathcal{N}(0, \sigma^2) \]

But rather than being regular population parameters, \(a_i\) and \(b_i\) are random intercepts and slopes, such that each cluster has its own, unique intercept and slope. The random intercept is a level 2 random effect:

\[ a_i = \alpha_0 + u_i \\ \text{where:}~~~u_i \sim \mathcal{N}(0, \sigma_u^2)\\ \] And the random slope is a level 2 random effect:

\[ b_i = \beta_0 + v_i \\ \text{where:}~~~v_i \sim \mathcal{N}(0, \sigma_v^2)\\ \] To keep things simple, we will make sure that the random intercepts and slopes are independent, that is, the correlation between \(u_i, v_i\) is zero \(\rho_{u,v}=0\).

As you can see, in this stage we have only one predictor at level one \(X\). We have no predictor at level 2. We also have to define six different population parameters. For example, we can choose \(\alpha_0 = 1, \beta_0 = 0.5\) and three different standard deviations: \(\sigma = 1, \sigma_u = 0.5, \sigma_v = 0.25, \rho_{u,v} = 0\).

Feel free to change these values if you would like to simulate different data.

# Fixed intercept
alpha_0 <-  ___ # choose a number you like, e.g. 1  

# Fixed slope
beta_0 <- ___ #  e.g. 0.5

# Residuals st.dev (level 1) 
sig  <-  ___ #  e.g. 1

# Random intercept st.dev (level 2) 
sig_u <-  ___ #  e.g. 0.5

# Random slope st.dev (level 2)
sig_v <-  ___ #  e.g. 0.25

# Independent random effects
cor <- ___ #  e.g. 0.5

Finally, here is the code that simulates our data in several steps. Run every step separately, and observe the data frame View(df) to its contents

# Create a dataset that can be reproduced
set.seed(888)  
# step 1: create the units 
df <- add_random(
    cluster = n_cluster,  # first the clusters
    obs = obs_per_cluster # second, the observations
    ) 

# Print the result 
# try to figure out what just happened
df |> 
  print(n = 15)

# step 2: Add anything at level 2
# Here we have random intercept/slope 
df <- df |>
  add_ranef(
    "cluster",
    u_i = sig_u,
    v_i = sig_v,
    .cors = cor
    ) |> mutate(
      a_i = alpha_0 + u_i, 
      b_i = beta_0  + v_i
    )

# Print the result 
# try to figure out what just happened
df |> 
  print(n = 15)


# step 3: add anything at level 1
df <- df |>
  mutate(
    x       = rnorm(n_cluster * obs_per_cluster),
    epsilon = rnorm(n_cluster * obs_per_cluster, 0, sig),
    y = a_i + b_i * x + epsilon
    )


# Print the result 
# try to figure out what just happened
df |> 
  print(n = 20)

# For the models, we really only need 
# the cluster id, the y and the x

df  <- df |> select(cluster, x, y)

# Print the result 
# try to figure out what just happened
df |> 
  print(n = 20)

So now we have our final observations: the clusters, the predictor x and the outcome y. Before continuing, please go back, set obs_per_cluster and n_cluster to larger numbers, to make sure our models work.

We will then test the data against three types of models:

  • Complete pooling: This model ignores the structure of the data, taking all observations to be uncorrelated and independent of one another. The model estimates just two fixed population parameters: the global intercept \(\beta_0\) and the global slope \(\beta_1\), and the formula you will use in your linear regression model is y ~ x.
  • No pooling: This model expands on the previous one. In addition to the global intercept and slope, we add the cluster (categorical) variable and estimate a coefficient for each cluster. In effect, each cluster is associated with a cluster specific intercept. Since we know that there is a variation in the slopes, you may also test a model with an interaction between the clusters and the predictor, x. The formula you will use in your linear regression model is y ~ x + cluster for the model without the interaction and y ~ x + cluster + x:cluster for the model with the interaction.
  • Multilevel model (partial pooling): This model also expands on the Complete pooling model to account for the correlated observations within the clusters. But instead of using fixed (dummy) variables, this model uses a random intercept (and, optionally, a random slope). The formula you will use for the random intercept model is y ~ x + (1 | cluster) and the formula or the random intercept and slope model is y ~ x + (x | cluster).
  1. Adjust the parameters, simulate your clustered data and present the results of the three types of models using the modelsummary package. Interpret your models and compare the estimations with each other, and with the six parameters used to simulate your data. Is there one single model that is superior to all the rest in terms of estimating the parameters? If so which is it, and what makes it better? If not, Which model works best for what purpose? In your answer, please pay attention to the standard errors and the Intraclass Correlation Coefficient (ICC) of your multilevel model(s). What does the ICC mean? How much does it deviate from what you would expect? For more about the ICC, please see below.

Further explanation and exploration: Use your imagination (and curiosity) to explore your simulations and your models. For example, you may want to compare multiple simulations with the same parameters, and see how they vary. You may want to change the sample size (the number of clusters and observations in each cluster), or you might want to increase the parameters, turn some of them to be negative, or just run multiple simulations with the same parameters, comparing the variation between one simulation and the next. Do the estimates vary a lot between simulations? Does increasing the sample size increase accuracy of the estimates? Does it influence the standard errors? Is there one single model that is superior to all the rest in terms of estimating the parameters? If so which model is it and in what way is it better? If not, which model works best for what purpose?

Calculating the ICC: The calculation of the ICC is straightforward when we only have random intercepts. Taking the random slopes into account complicates the situation because in that case, the ICC varies with the predictors. Because here the predictor is normally distributed around zero, the calculation of the ICC (at \(x=0\)) reduces to the regular form of the ICC, which need not take the random slope into account.

For example, in the example below, data is simulated with 30 clusters, 20 observations per cluster, a global intercept of \(2\) and a global slope of \(1.5\), the variance of the (level-one) residual is \(1\), the variance of the random intercept and the random slope are \(1.5^2\) and \(1.8^2\) respectively, and the correlation coefficient is \(0.5\).

You can see that the no-pooling models have larger standard errors for the intercept, probably because the estimation is based on data from one single cluster (the reference cluster with 20 observations and a small number of residual degrees of freedom). The no-pooling model with interactions is suffering from relatively large standard errors (most probably due to a small number of residual degrees of freedom). Also, note that for the no-pooling model, the global intercept is associated with the intercept of the reference cluster. So you are bound to get an estimate that is in effect the sum of the global intercept and the random intercept of the first cluster (\(=\alpha_0 + u_1\))

You will not necessarily get the same estimations as in the table below! Every simulation can be different, therefore the estimates will be slightly different, leading to different results.

Simulation parameter Complete pooling No pooling* No pooling + interaction* Multilevel model Multilevel model + random slopes

(Intercept)

α_0 = 2 1.844 (0.108)*** -1.557 (0.463)*** -1.868 (0.232)*** 1.844 (0.317)*** 1.896 (0.321)***

x

β_0 = 1.5 1.184 (0.115)*** 1.196 (0.092)*** -0.883 (0.231)*** 1.195 (0.092)*** 1.374 (0.366)***

cluster02

2.940 (0.654)*** 3.252 (0.327)***

cluster03

3.945 (0.654)*** 4.338 (0.327)***

x:cluster02

-0.957 (0.311)**

x:cluster03

3.590 (0.319)***

x:cluster04

2.395 (0.351)***

x:cluster05

-0.702 (0.378)+

x:cluster06

0.890 (0.303)**

SD (Intercept )

σ_u = 1.5 1.672 1.744

SD (x )

σ_v = 1.8 1.988

Cor (Intercept~x )

ρ = 0.5 0.616

SD (Observations)

σ = 1 2.068 1.028

Num.Obs.

600 600 600 600 600

R2

0.151 0.506 0.884

R2 Adj.

0.150 0.480 0.871

R2 Cond.

0.487 0.886

AIC

2873.1 2606.6 1794.9 2663.2 1978.3

BIC

2886.3 2747.3 2063.1 2680.8 2004.6

ICC

0.4 0.9

RMSE

2.64 2.01 0.98 2.02 0.98
*Only a small sample of cluster level fixed effects shown




Several approaches can be used to calculate the confidence intervals of random effects in multilevel models, each with its own strengths and limitations. Here are three of the most common methods:

  • Parametric Confidence Intervals: This approach is based on the assumption that the random effects follow a normal distribution. The standard errors of the random effects are estimated from the model and used to compute confidence intervals using the normal distribution. This method is straightforward and computationally efficient, but it may be inaccurate if the actual distribution of the random effects deviates significantly from normality. To calculate parametric confidence intervals, use the augment function from the broom.mixed package.

  • Bootstrap Confidence Intervals: Bootstrapping is a non-parametric method that involves repeatedly resampling the data with replacement and refitting the model each time. The resulting distribution of the random effects from these multiple fits is used to construct bootstrap confidence intervals. This method is more robust to non-normality than parametric methods, but it can be computationally intensive, especially for large datasets. To calculate bootstrap confidence intervals, use the predictInterval function from the merTools package.

  • Bayesian Credible Intervals: Bayesian multilevel models use a probability distribution to represent the uncertainty in the model parameters, including the random effects. Credible intervals are derived from the posterior distribution of the random effects, which reflects the combined information from the data and the prior distribution. Bayesian methods are more flexible and can handle complex hierarchical structures, but they may require more expertise and computational resources compared to frequentist methods.

Each method provides a different estimate of the confidence intervals of random effects in multilevel models and there is no single “correct” approach. The choice of method should be guided by the specific characteristics of the data, the research question and the researchers philosophy. For simpler models and relatively large sample sizes, parametric methods may be sufficient. For more complex models or smaller datasets, bootstrap or Bayesian methods may be more appropriate. Additionally, the interpretation of confidence intervals for random effects depends on the type of inference being made. For example, conditional confidence intervals and marginal confidence intervals are used for different purposes.

  1. Use your models to create a caterpillar plot for the intercepts and lattice and/or spaghetti plots to show the features of your multilevel models. Describe and discuss the plots: what new light do they shed? What are the advantages or disadvantages of presenting the models using these plots?

To create the caterpillar plot, you can use the following code. Don’t forget to fill in the gaps. See the appendix below for more information about the caterpillar plots, their use and their interpretation.

# First, we need the list of clusters
clusters <- df |> 
  pull(cluster) |> 
  unique()

# Now let's fit our multilevel model
mdl_multilevel <- lmer(y ~ x + (x | cluster), df) 

# predictions of y for each cluster at x = 0
predict_y <- mdl_multilevel |> 
  ranef() |> 
  # Used the broom.mixed::augment function 
  # to calculate the confidence intervals
  augment() |> 
  # convert to a tibble
  as_tibble() |> 
  # Include only the random intercept
  filter(variable == "(Intercept)") |> 
  # Now we order the results by ascending estimates
  arrange(estimate) |> 
  mutate(cluster = fct_inorder(level)) 

# Finally, let's plot our caterpillar!
predict_y |> 
  ggplot(aes(xmin = ___, xmax = ___, y = ___, x = ___)) + 
    geom_pointrange( ) + 
    labs(x = "Estimate of the intercept (Y_hat when X=0)", y = NULL)

To create the lattice plot and the spaghetti plot, you can use the code below. See the appendix below for more information about these different plots, their use and their interpretation.

predictions(mdl_multilevel) |>  
  ggplot(aes(___, ___, color = ___)) + 
  geom_point() + 
  geom_line(aes(y = ___)) + 
  # Use facets to create the lattice plot
  facet_wrap(~___)
  # Remove it to create the spaghetti plot

Radon in the home: a case study

Our aim in examining this data is to estimate how radon levels are distributed across the approximately 3000 counties in the United States. This information can empower homeowners to decide whether to test and address radon in their homes, using the best local knowledge available.

To conduct this analysis, we view the data hierarchically: houses (level one) within counties (level 2). If we were to delve deeper into multiple measurements within houses, we could create a three-tier hierarchy: measurements (level 1), houses (level 2), and counties (level 3). During the analysis, we consider an important factor: where the measurement was taken, whether in the basement or on the first floor. Radon, originating underground, can more readily enter homes built into the ground.

Additionally, at the county level, we used a a measurement of soil uranium available for each county. This hierarchical model enables us to create a regression model for individual measurements while accommodating consistent, unexplained variations among the 3000 counties.

This section closely follows chapter 12 in Gelman and Hill’s influential book Data analysis using regression and multilevel/hierarchical models. You will find the chapter available as an optional reading (link also available through Brightspace). The chapter is available on Perusall to allow you to share code, discuss, or ask questions. I will be looking at the discussion from time to time, so if you need help, please do not hesitate to @tag me 🤙

In that chapter you will find the background for the study, and an explanation of the variables. The original dataset and the code used to produce the book is available here.

The dataset consists of a large number of measurements of the gas Radon, and so in this document we will be including only the measurements from the state of Minnesota. Feel free to choose a different state if you like.

radon_mn <- read_rds("https://bit.ly/BMLR-Radon") |> 
  filter(state == "MN") |>
  # drop counties without measurements
  mutate(county = fct_drop(county))
  1. Create a table1 and use GGally::ggpairs to explore your data.

You will now visualize the estimated and confidence interval for the average log Radon across different counties, against the (jittered) number of measurements taken in said county. We will use the estimation to compare between two models: the no pooling model and the multilevel model with partial pooling, both without predictors. The horizontal line in each plot represents an estimate of the average radon level across all counties. This is a replication of figure 12.1 in the above mentioned chapter.

  1. Replicate the figures below and explain why the counties with fewer measurements have more variable estimates and larger higher standard errors. What is the main advantage and disadvantage of no-pooling analysis, as illustrated in the plot on the left compared with the multilevel model? Explain.

First, we need to associate each county with the number of observations.

radon_mn_count <- radon_mn |>
  count(county, name = "n.obs")

m.no_pool <- radon_mn |>
  lm(log.radon ~ 0 + county, data = _)

m.partial_pool <- radon_mn |>
  lmer(log.radon ~ (1|county), data = _)

# Use here the two models - first the fixed effects (no pooling)
# model, and then the multilevel (partial pooling) model.
___ |> 
  predictions(newdata = datagrid(county = unique)) |> 
  as_tibble() |> inner_join(radon_mn_count, by = "county") |> 
  select(county, estimate, conf.low, conf.high, n.obs) |> 
  ggplot(
    aes(
      x    = ___, 
      y    = ___, 
      ymin = ___, 
      ymax = ___
      )
    ) + 
  geom_pointrange(
    position = position_jitter(width = .5, height = 0)
    ) 
  1. In this exercise, you will replicate a set of models as shown below. This illustration is based on the ideas shown in figure 12.4 in the book. Please replicate this illustration with a twist: for example, you may choose a different set of counties than the one shown below or the one shown in the book, or you may try something different. Please explain the graphs. What do the dashed red and black lines show? What is the blue line? What determines whether the blue line would be more similar to the red line or the black line? How is this figure compare to the one shown in chapter 12 of the book?

The three lines show the predictions of the three models: the complete pooling in red, the no pooling (dashed black line) and the multilevel (partial pooling) model (blue line). You can see that the complete pooling model does not vary with the county level, showing the exact same intercept and slope at each cell in the lattice. The no pooling model is very sensitive to the data in the county: even when there are just two observations, such as in the county of Mille Lacs, we see the no pooling model connecting the two dots. This shows how a no-pooling approach tends to over-fit the data. The multilevel model finds a compromise between the no pooling and the complete pooling models. It introduces some bias, but avoids over fitting and is more useful for generalizations.

Hints To create the figure, you will need to plot the data itself as dots using the ’geom_point()` layer, and then add the three models, one layer at a time. To simplify, I will provide the code for one model and you can add the rest yourselves.

# Fit the model (you will need to fit all three models)
mdl_multilevel <- radon_mn |>
  lmer(log.radon ~ first_floor + (first_floor|county), data = _)

# Collect the predictions
pred_multi <- mdl_multilevel |>
  predictions(
    newdata = datagrid(
      first_floor = c(0, 1),
      county = c("CLAY", "RICE", "SCOTT")
    )
  ) |> 
  select(county, first_floor, estimate, conf.low, conf.high) |> 
  as_tibble( )  

radon_mn |> 
  filter(county %in% c("CLAY", "RICE", "SCOTT")) |>
  ggplot(aes(x = first_floor, y = log.radon)) +
  geom_point() + 
  geom_line(data = pred_multi, aes(first_floor, estimate))  + 
  facet_wrap(~county) 
  1. In this exercise you are going to add a fixed effect at the county level (level 2). This illustration is based on figure 12.6 in the book, and it shows the estimated county level coefficients, plotted vs. county level uranium levels, along with estimated multilevel regression line at level 2. After replicating this graph, please explain it briefly.

To create the plot, we will need to fit two models: the linear slope is created with the linear model showing the expected log-radon outcome as a function of the county’s uranium level. The

# The simple linear model for the general relationship between 
# Radon measurements and county level uranium
md.lm <- radon_mn |>
  lm(log.radon ~ u , data = _)

# The multilevel model for the radon as a function of level-2 uranium
md.lvl12 <- radon_mn |>
  lmer(log.radon ~ u +  (1|county), data = _)


# Predictions for the linear model
pred.lm <- predictions(
  md.lm,
  )

# predictions for the multilevel model
# stratified by the level of uranium
pred.lvl12 <- predictions(
  md.lvl12,
  by = "u"
)


pred.lvl12 |>
  ggplot(aes(x = u, xend = u, y = conf.low, yend = conf.high)) +
  geom_segment() +
  geom_point(aes(y = estimate)) +
  geom_line(
    data = pred.lm,
    aes(x = u, y = estimate),
    ) 
  1. Using the simulation techniques in questions 1 and 2, generate a synthetic dataset that would be similar to the radon data, in the sense that if you fitted the three models to your simulated dataset, you would get similar parameter estimations to the ones you got when fitting your models to the real Radon dataset. Explore your synthetic data using similar visualisations, and comment on similarities or differences between the Radon dataset and your simulated dataset.

Appendix

This appendix has several additional tables, figures and code, designed to help you develop your lab and explore the themes it introduces further.

Caterpillar plots

Caterpillar plots are graphical representations used to display random effects or individual-level estimates from the model. These plots help visualize the variation among the random effects across different levels of a categorical variable, such as groups or clusters.

Caterpillar plots depict the estimated values of the random effects for each level of the grouping variable. Each ‘caterpillar’ in the plot represents the estimated effect for a specific group, with its length indicating the uncertainty of the estimation.

Interpretation: The horizontal axis represents the estimated random effects or individual-level estimates. The vertical axis shows the different levels of the categorical variable (e.g., different groups or clusters). The ‘caterpillars’ (lines) show how the estimated random effects vary across these levels. Variability in the positions of the lines indicates the extent of differences between groups. Lines varying widely in length or spread, suggests substantial variation in the random effects or the uncertainty across the different levels of the categorical variable.

Caterpillar plots are helpful for understanding how much variation exists at the individual level (random effects) within different groups or clusters, providing insights into the impact of these group-level differences in mixed models. They’re valuable tools for visualizing and interpreting the distribution and variability of random effects.

The caterpillar plots for the intercepts are shown below, calculated as the predicted value of the intercept and its confidence interval for each cluster. The confidence interval for the no-pooling model is the largest, and that the estimates preserve the rank more when comparing the no-pooling fixed effects model with the multilevel model with a random intercept.

Spaghetti and lattice plots

Spaghetti and lattice plots are graphical representations that display individual trajectories or changes over time for each participant or subject in a study. These plots are particularly useful when examining longitudinal data or repeated measures where observations are taken at multiple time points for each individual. The x-axis typically represents time or the repeated measurements. The y-axis represents the outcome variable or response being measured. Each line (or “spaghetti strand”) represents the path or trajectory of one individual’s responses across time.

Spaghetti plots are valuable for understanding the variation and patterns within individual trajectories over time, providing a visual means to explore longitudinal data in multilevel models and identify potential relationships or trends within the dataset.